Word | Frequency | Number of right neighbors | Ratio |
---|---|---|---|
Finalment | 256 | 1 | 256.0000 |
Noticies | 240 | 1 | 240.0000 |
Diari | 929 | 4 | 232.2500 |
Halloween | 207 | 1 | 207.0000 |
obstant | 414 | 2 | 207.0000 |
Castanyada | 199 | 1 | 199.0000 |
Organitzes | 199 | 1 | 199.0000 |
còpia | 198 | 1 | 198.0000 |
col·lectiva | 191 | 1 | 191.0000 |
concret | 188 | 1 | 188.0000 |
cookies | 181 | 1 | 181.0000 |
oferir-te | 180 | 1 | 180.0000 |
Archive | 178 | 1 | 178.0000 |
Blog | 178 | 1 | 178.0000 |
coma | 178 | 1 | 178.0000 |
MissatgeEnviar | 175 | 1 | 175.0000 |
creu | 338 | 2 | 169.0000 |
entradas | 167 | 1 | 167.0000 |
multiples | 167 | 1 | 167.0000 |
Separe | 167 | 1 | 167.0000 |
On average, words with higher frequency have more co-occurrences. But there are exceptions. The following subsections look for those exceptions. First we look for words having few right co-occurrences. As a measure we use the ratio of the number of left neighbors divided by frequency.
Depending on the language, we find words which co-occur mostly with the same partners. This may happen for multi-word proper names and words before punctuation marks.
Due to tokenization rules, some full stops are removed from abbreviations. Hence we can find such abbrebiations previously unknown to the system.
In general, larger corpora may give clearer results.
Table data:
select word, w.freq,count(c.w2_id), w.freq/count
(c.w2_id) as r from words w, co_n c where w.w_id=w1_id and w1_id>100 group by w1_id order by r desc limit 20;
Diagram data:
select w.freq, count(c.w2_id) from words w, co_n c where w.w_id=w1_id and w1_id>100 group by w1_id;
Some diagrams seem to have a clear cut for frequencies <100. This seems to be a bug.
How can we calculate the mean slope of the diagram as a language constant?
5.1.7.2 Number of NN co-occurrences vs. Frequency II
5.1.7.3 Number of left vs. right NN co-occurrences